Nature Biotechnology — Latest Matching Preprints

1

Privacy-Preserving Matching for Federated Causal Inference in Multicentre Patient Cohorts

Gusinow, R.; Morgan, A. S.; Canziani, L. M.; Zeitlin, J.; Kim, M.; Gentilotti, E.; Ghosn, J.; Florence, A.-M.; Tami, A.; Toschi, A.; Palacios-Baena, Z. R.; Tacconelli, E.; Hasenauer, J.

2026-07-19 epidemiology 10.64898/2026.07.16.26358171 medRxiv

Top 2%

2.4%

Show abstract

Causal effect estimates can often be biased in clinical and epidemiological studies as patient cohorts frequently exhibit substantial covariate imbalances between treated and control groups, often amplified in multicentre studies due to heterogeneous recruitment, clinical practice, and case mix. Covariate balancing methods are therefore essential for valid causal inference. However, their application becomes challenging when data are distributed across cohorts and cannot be pooled because of privacy, legal, or institutional constraints, leaving a gap in practical methods for causal effect estimation in federated and imbalanced clinical data settings. We develop a privacy-preserving framework for covariate balancing and causal effect estimation across distributed data providers, combining federated aggregation with differential privacy to enable propensity score subclassification and matching without sharing individual-level records. Matching relies on non-disclosive quantities and differentially private distance evaluation, and the resulting matched subsets remain local to each server. Balance can be assessed through federated diagnostics and privacy-preserving visualisations, and we provide secure estimators for average treatment effects with associated uncertainty quantification. We implement this framework in the DataSHIELD federated analysis platform via 2 R packages. In simulations, we demonstrate agreement between federated and centralised analyses in the absence of privacy noise and quantify the bias--variance trade-offs induced by differential privacy. We illustrate applicability in two multinational settings-a Long COVID cohort and very preterm birth cohorts-showing that the approach enables practical causal analyses under real-world data protection constraints. The DataSHIELD packages are available on Github. Additional methodological details are provided in the Supplementary Material.

2

Nationwide Mpox Genomic Surveillance Reveals Clade Ib Introductions, APOBEC3-Driven Evolution, and Terminal Deletions

Brochu, H. N.; Shi, Q.; Song, K.; Zhang, Q.; Munroe, J.; Harris, N. J.; Britt, N.; Zeng, Q.; Kapuria, K.; Chappell, J.; Norvell, B. M.; Peavy, L.; Williams, J. D.; Harris, A. B.; Chaitram, J.; Hutson, C. L.; Deng, J.; McGrath, D.; Boles, D.; Dale, S. E.; Gigante, C. M.; Iyer, L. K.

2026-07-17 infectious diseases 10.64898/2026.07.15.26357894 medRxiv

Top 2%

1.8%

Show abstract

Background The 2022-2023 global mpox outbreak highlighted the critical need for robust genomic surveillance capabilities to track mpox virus (MPXV) evolution and transmission dynamics. Methods Building upon our established SARS-CoV-2 sequencing infrastructure, we implemented a Molecular Loop probe-based long-read sequencing approach using Pacific Biosciences Sequel II technology for comprehensive MPXV genomic surveillance across the United States (US). From August 2024 to June 2025, we generated 326 high-quality whole genome sequences from residual mpox-positive clinical specimens collected by Labcorp across all 10 US Department of Health and Human Services regions. Results Our analysis identified two samples containing clade Ib MPXV in January and June 2025 and captured shifting trends in clade IIb diversity, with 13 distinct lineages observed. We also identified multiple instances of large (~1.6-17.6kb) deletions proximal to the inverted terminal repeats in clade IIb genomes. APOBEC3 mutation analysis indicated substantial evidence of human-to-human transmission among both clades. Further, we observed significantly higher APOBEC3-associated SNPs per kilobase (P<0.001) in clade IIb genomic variable regions relative to their central conserved region. Our assay exhibited strong reproducibility across biological replicates from individual patients and accuracy was confirmed via parallel sequencing of select specimens by US Centers for Disease Control and Prevention (CDC) using metagenomic sequencing. We also demonstrated via custom simulation that our assay discriminates all known MPXV clades and lineages, including those we have not observed in the US. Conclusions Our integrated nationwide surveillance system facilitates real-time genomic tracking of outbreak evolution, with demonstrated capacity across SARS-CoV-2 and MPXV, positioning this platform for rapid deployment during future pathogen emergence.

3

From amplicon to antigen: a quantified transmission map that nominates multi-antigen antibody-drug-conjugate co-target sets across cancer types

Lam, J. M.; Walker-Samuel, S.; Pennycuick, A.

2026-07-16 oncology 10.64898/2026.07.13.26357987 medRxiv

Top 3%

1.5%

Show abstract

Somatic copy-number amplification is pervasive in cancer, and the genes it carries are candidate drug targets - but only those whose amplification is transmitted to accessible surface protein can be reached by an antibody-drug conjugate (ADC). We build an integrated map of copy-number-to-protein transmission across six tumour types and ask, for every amplified gene, whether its dosage reaches the surface. Copy number transmits to mRNA (median per-gene r = 0.21) but is attenuated at the protein level in 85% of genes, and the mRNA ranking is largely preserved to protein (rho = 0.70); the ranking is set principally at the chromatin/transcription step - among directly measured regulatory inputs, promoter DNA methylation and tumour chromatin accessibility each explain about an order of magnitude more of the transmission variance than gene structure, and do so complementarily. Critically, transmissibility is a stable, gene-intrinsic property: it is predictable from gene properties alone, with no proteomic input, at a leave-gene-out rank correlation of 0.52 (R2 = 0.29); it is not positional (holding out whole chromosome arms changes accuracy by 0.001); and it transfers across lineages (Kendall W = 0.97 across leave-one-lineage-out refits). This licenses a predictor that nominates surface targets in cancer types that lack a tissue-referenced proteome, combining direct protein measurement where it is available with prediction where it is not. Requiring co-elevation on a recurrent amplicon with measured transmissibility and an accessible extracellular ectodomain nominates 22 surface antigens on 18 distinct recurrent amplicons across four cancer types (renal, endometrial and both lung subtypes) - for example ITGB8+TSPAN13+TTYH3 on lung 7p, NCSTN+HSD17B7+MPZL1 on 1q (recurrent in several types), the transferrin receptor TFRC on squamous 3q, and FZD1 on clear-cell renal 7q; 21 of the 22 are non-driver passengers and 10 are confirmed on the experimental Cell Surface Protein Atlas. In single malignant cells, against a null that controls for per-cell sequencing depth, the co-detected constructs sit at a modest 1.05-1.45x above independence (p < 0.001, donor-block bootstrap intervals clear of 1.0), and at binding-relevant thresholds the normal-tissue co-expression collapses - so an avidity AND-gate that binds stably only where the antigens co-occur would spare normal cells that carry only one. Observed transmissibility itself transfers strongly between the two lung subtypes ({rho} = 0.88) and remains positive across distant lineages, consistent with the shared cell-of-origin regulation the map implies. Single-cell co-detection is demonstrated wherever a malignant single-cell atlas exists (both lung subtypes and glioblastoma - the latter entirely from prediction, using no GBM surface-abundance measurement); the remaining cohorts are nominated on the same genetic and topological evidence. The result is a pan-cancer, confidence-tiered catalogue of multi-antigen ADC co-target sets with a concrete plan to test them.

4

The Variance-Stabilizing Transformation for the Poisson Rate Ratio: Closed-Form Confidence Intervals

Ng, S.-P.

2026-07-18 epidemiology 10.64898/2026.07.16.26358255 medRxiv

Top 3%

1.1%

Show abstract

The incidence rate ratio R is the standard measure for comparing event rates in clinical trials and epidemiology. In vaccine trials, the vaccine efficacy is VE = 1 - R. When events are rare, the two arm counts are Poisson. The estimator of R is heteroskedastic: its sampling variance changes with the data. So no fixed-width interval covers correctly everywhere. The usual log-Wald interval is undefined at zero events and covers poorly at small counts. Early vaccine and drug-safety readouts fall in exactly this regime. We show that a single reparameterization collapses this bivariate problem to an effective one-parameter family with a quadratic variance function, whose variance-stabilizing transformation is 2 arcsinh(sqrt(R)). The reduction yields a closed-form confidence interval for R. Its two leading errors, a curvature bias and the variability of the estimated scale, each admit a closed-form correction with no tuning constants. In a Monte Carlo study of our seven arcsinh variants and five competitors, the +Curve+Stu variant covers within 0.002 of the nominal 0.95 for about 50 control and 5 treatment events. Its width is on par with the best competitor. It avoids the conservatism and zero-count breakdown of log-Wald and MOVER. For moderate counts, we recommend this interval; for sparser data, our Bar-Lev and Enis count-shift variant is more robust. The result is a ready-to-use, closed-form interval for the low-count regime. We illustrate it on early Covid-19 vaccine-efficacy readouts and provide reference implementations in R and Python.

5

CRISPR RNA-independent activation of Cas12a

Iwe, I. A.; Singh, S.; Guan, K.; Ocampo, R. F.; Ribeiro da Silva, S. J.; Wachholz Junior, D.; Emami, N.; Corsano, A.; Zeisler, I.; Bozovicar, K.; Wang, L.; Ham, D.; Cai, R.; Kelly, P.; Zayeni, R.; Nguyen, J.; Bayat, P.; Charania, M.; Palter, S.; Liu, F. X.; Shrestha, S.; Rayhan, A.; Wasney, G. A.; Mazzulli, T.; Green, A. A.; Li, Z.; Yao, S.; Hubbard, B. P.; Taylor, D. W.; Pardee, K.

2026-07-16 primary care research 10.64898/2026.07.14.26358058 medRxiv

Top 6%

0.5%

Show abstract

CRISPR-Cas12a nucleases are classically activated through CRISPR RNA (crRNA) guided and PAM-dependent target recognition, which together establish a canonical heteroduplex associated with nuclease activation. Here we identify a crRNA- and PAM-independent activation pathway for Cas12a that reveals previously unrecognized conformational plasticity within its nucleic acid recognition interface. We show that short RNAs can directly occupy the canonical crRNA-binding channel and trigger a catalytically competent trans cleavage state in the absence of PAM recognition or canonical R-loop formation. Biochemical assays indicate that short RNAs bind the crRNA-binding channel and are competitively displaced by cognate crRNA, consistent with binding at a conserved nucleic acid-binding interface. Cryo-electron microscopy (cryo-EM) further reveals that Cas12a maintains its global catalytic architecture while exhibiting loss of canonical PAM-dependent stabilization and increased flexibility of the RuvC lid, alongside accommodation of a noncanonical RNA-DNA hybrid with inverted polarity relative to the crRNA-target duplex. This crRNA-independent activation pathway enables programmable, amplification-free detection of DNA and RNA targets independent of canonical guide-mediated recognition. Together, these findings define an alternative activation geometry for Cas12a and expand models of Class 2 CRISPR-Cas effector activation beyond crRNA- and PAM-directed recognition.

6

A ReAct Agentic AI System for Natural Language Querying and Statistical Analysis of The Cancer Genome Atlas Clinical Data

Korutla, R.; Amal, S.

2026-07-17 health informatics 10.64898/2026.07.15.26358188 medRxiv

Top 7%

0.4%

Show abstract

The Cancer Genome Atlas (TCGA) holds clinical data for over 11,000 patients across 33 cancer types, but access is hard because of complex file structures, heterogeneous formats, and the need for programming. We present an agentic system for natural language querying and statistical analysis of TCGA clinical data. The system uses a large language model as an autonomous ReAct agent that selects from eight computational tools, including data extraction, descriptive statistics, Kaplan-Meier survival analysis with log-rank tests, hypothesis testing, and verification against the curated TCGA Pan-Cancer Clinical Data Resource (CDR). The agent reasons about intermediate results, adapts its approach, and returns clinically contextualized responses with source attribution and auditable traces. We introduce TCGA-Agent-Bench, 440 queries across five difficulty tiers with ground truth from the independently curated TCGA-CDR, evaluated with dual metrics of numerical accuracy and clinical completeness. The system achieves 93.4% overall accuracy (100% single-patient lookups, 99.1% cohort statistics, 92.8% comparative analyses), outperforming a fixed rule-based pipeline (87.1%), a single-pass LLM (81.8%), and retrieval-augmented generation (66.9% on a subset). Most of the benchmark is answerable from the CDR alone, so we locate the extraction layer's value in fields the CDR lacks (drug treatments, TNM components, biomarkers, biospecimen metadata): on 26 queries targeting these, the full system answers 100% versus 3.8% for CDR-only. Ablations show the reasoning loop is most impactful (+9.1% accuracy, +22.0 completeness points). A tool-based agentic architecture enables accurate, auditable analysis of clinical repositories, with value driven by tool design and recovered fields rather than model scale.

7

CuGen: A GPU-accelerated framework for large-scale genomics

Kiiskinen, T.; Richland, J.; Wang, W.; Lu, W. S.; Balasubramanian, N.; Hastie, T.; Tibshirani, R.; Rivas, M. A.

2026-07-17 genetic and genomic medicine 10.64898/2026.07.15.26358178 medRxiv

Top 8%

0.3%

Show abstract

Biobank-scale genomic analyses remain computationally expensive, CPU-bound workflows, particularly when adjusting for confounding. Here, we present CuGen, a GPU-accelerated framework for large-scale genomics. CuGen uses UltraLasso, a novel hierarchical application of univariate-guided sparse regression (uniLasso), to select a compact, phenotype-informed active set of fewer than 30,000 variants. This achieves robust leave-one-chromosome-out (LOCO) confounding control, enabling both downstream GWAS and in-sample fine-mapping. Additionally, we introduce the .cugen file format, a genotype representation designed for memory-optimized, high-throughput streaming and random access on GPU hardware. Building on this substrate, we provide a general GPU-accelerated genomics toolkit handling polygenic prediction, data manipulation, quality control, analysis, and visualization. We demonstrate CuGen's efficacy in the UK Biobank with up to 408,624 individuals, where the full GWAS pipeline and fine-mapping against 6.8 million imputed variants completes in approximately 10 minutes on a single high-throughput GPU with 80 GB of memory. The pipeline scales efficiently to massive phenome-wide analyses with sublinear resource consumption.

8

FoodScribe: an open-source semantic framework for nutrient estimation from free-text dietary records

Gouda, H.; Sala Climent, M.; Agongo, J.; Gaikwad, S. P.; Nattakom, A.; Zhao, H. N.; Xing, S.; Boland, B. S.; Holt, T.; Guma, M.; Dorrestein, P. C.

2026-07-17 nutrition 10.64898/2026.07.15.26358181 medRxiv

Top 9%

0.2%

Show abstract

Efficiently summarizing dietary records at scale remains a persistent bottleneck in nutritional epidemiology. We present FoodScribe, which translates free-text meal descriptions into quantitative nutrient profiles by combining ingredient parsing with nutrient retrieval by querying the USDA FoodData Central (FDC) database. Benchmarked using three LLM providers using Nutribench dataset, FoodScribe completed annotation of 3,807 meal descriptions in 2.5 hours, a task otherwise requiring substantial manual effort from trained nutritionists. FoodScribe achieved accuracy across macronutrient estimation (F1=0.79-0.89), with models performing better for protein than fat estimation. Application to a Mediterranean diet intervention cohort indicated dietary shifts consistent with the intervention pattern based on model-derived estimates. Integration with metabolomics data suggested that fiber and vegetable intake were positively associated with a fecal metabolite cluster.

9

Efficient stochastic epidemic simulation via the Sellke construction

van Boven, M.; Bootsma, M. C.

2026-07-17 epidemiology 10.64898/2026.07.16.26358219 medRxiv

Top 10%

0.1%

Show abstract

Stochastic epidemic models are a cornerstone of infectious disease epidemiology and are often used to study intervention scenarios. However, large run-to-run variability can make intervention effects difficult to estimate precisely. We revisit the epidemic Sellke construction, which assigns each individual an infection threshold for the cumulative infection hazard such that, conditional on the thresholds, the epidemic trajectory becomes deterministic. This enables coupling of simulations with and without an intervention, yielding low-variance effect estimates even when outcomes such as final size or peak incidence vary widely between runs. We develop an exact, event-driven implementation that maintains infection and recovery events in priority queues. Cumulative infection-hazard updates require O(log N) time per event, yielding overall complexity O(Elog N) for E events in a population of size N. The implementation achieves computational performance comparable to the classical Gillespie algorithm while naturally accommodating non-Markovian infectious periods and complex infectiousness profiles. We illustrate the approach using distance-dependent spread of avian influenza between poultry farms in the Netherlands and a multilayer population with households, schools, and workplaces. In both examples, coupling enables efficient within-run comparisons of intervention scenarios across stochastic realisations.

10

LocusBlend: Flexible multi-index regional visualization of genomic association signals

yang, c.; Cook, N.; Zeng, Y.; Fu, T.; budde, J.; Cruchaga, C.; Belloy, M. E.

2026-07-21 genetic and genomic medicine 10.64898/2026.07.15.26358129 medRxiv

Top 10%

0.1%

Show abstract

Summary It has become standard practice to visualize regional signals from genomewide association studies GWAS using LocusZoom plots Similarly GWAS signals are compared to regionally matched quantitative trait loci QTLs ie varianttogene regulation data using LocusCompare plots to aid assessment of candidate traitrelated genes Despite broad usage these tools annotate variants by linkage disequilibrium LD to a single lead or index variant This singleindex representation has limitations for visualizing complex loci that contain multiple independent signals We present LocusBlend an interactive web application for multiindex LDblended visualization of genomic loci LocusBlend supports one or two genomic association summarystatistic datasets and one to three index variants multiindex LocusZoom colorblended plots and matching LocusCompare visualizations Applications to Alzheimers disease GWAS and QTL signals illustrate LocusBlend enables visualization and separation of independent signals despite shared LD and high genomic complexity Overall LocusBlend is aimed at supporting researchers handle the continuously expanding complexity of human genomics findings Availability and Implementation LocusBlend is freely available at httpslocusblendwustledu Publication ready plots are generated in 1min Source code documentation example datasets input templates and reproducibility instructions are available at httpsgithubcomBelloyLabLocusBlend LocusBlend is implemented in Python using Streamlit Plotly and PLINK Supplementary Information Supplementary data are available online

11

Single-cell gene programs define subtype identity and metastatic trajectories in renal cell carcinoma

Madrigal, A.; Kim, M.; Mehrjoo, Z.; Nishimura, T.; Saatci, O.; Osakwe, A.; Zavacky, E.; Moslemi, E.; Glennon, K. I.; Dankner, M.; Maritan, S. M.; Kuasne, H.; Pilon, V.; Monast, A.; Soytas, M.; Arseneault, M.; Oikonomopoulos, S.; Harutyunyan, A.; Lu, T.; Rayes, R.; Soto, L. M.; Hernandez-Corchado, A.; Spicer, J. D.; Petrecca, K.; Siegel, P.; Park, M.; Ragoussis, J.; Sahin, O.; Brimo, F.; Tanguay, S.; Riazalhosseini, Y.; Najafabadi, H. S.

2026-07-16 genetic and genomic medicine 10.64898/2026.07.14.26357682 medRxiv

Top 11%

0.1%

Show abstract

While extensive cellular heterogeneity in renal cell carcinomas (RCC) is linked to diverse clinical outcomes, our understanding of this diversity is limited to those driven by clonal patterns or activity of canonical pathways. Here, we present a compendium of over 85,000 single-cell gene expression profiles from primary and metastatic tumors as well as patient-derived models across four RCC subtypes, including the rare clear cell papillary renal cell tumors, which we show are often misclassified and for which we identify CASP14 as a highly sensitive and specific biomarker. We dissect malignant cell variation within and across tumors using a generative modeling framework that accounts for clonal and copy number-driven expression shifts, defining 59 gene expression programs that deconstruct canonical pathways into functional submodules with divergent activity patterns, distinct regulators, and differential association with clinical outcomes. Despite the canonical view that VHL-deficient clear cell RCC exists in a constitutive pseudohypoxic state, we show strong intra-tumor variability of a hypoxia inducible factor 2 (HIF2)-driven program linked to poor outcome. We also identify early, spatially organized activation of a complete epithelial-to-mesenchymal transition (EMT) program, loss of epithelial identity, and upregulation of protein translation programs as key characteristics of metastatic progression. Finally, a metastatic signature capturing cellular de-differentiation and translational activity identifies primary tumors associated with adverse clinical outcomes. Together, this resource establishes a framework for dissecting malignant cell heterogeneity, refines RCC subtype classification, and defines transcriptional programs underlying metastasis progression.

12

Municipal wastewater surveillance reveals socioeconomic and immigration gradients in antimicrobial resistance across Alberta, Canada

Lee, J.; Gonzalez, C.; Au, E.; Acosta, N.; Waddell, B. J.; Xu, Z. S.; Clark, R. G.; Weyant, R. B.; Dalton, B.; Zaheer, R.; McAllister, T. A.; Barkema, H.; Nobrega, D.; Bhatnagar, S.; Lee, B. E.; Pang, X.; O'Grady, C.; Frankowski, K.; Bertazzon, S.; Conly, J. M.; Hubert, C. R. J.; Parkins, M. D.

2026-07-21 infectious diseases 10.64898/2026.07.19.26358431 medRxiv

Top 11%

0.1%

Show abstract

Antimicrobial resistance (AMR) is an ever-increasing threat to population health. Industrial, environmental and societal factors are increasingly recognized as important contributors to AMR within communities. Here, we investigated the spatial distribution of AMR genes (ARGs) across Alberta, Canada and their association with socio-economic, immigration-related, and agro-industrial characteristics using municipal wastewater-based surveillance. We analyzed monthly wastewater metagenomes collected between March 2022 and March 2023 across eleven municipalities, representing 39% of Alberta's population. Integration with census data enabled multivariate analysis, revealing that municipal resistome profiles were strongly structured along income and immigration-related population gradients. ARGs spanning 14 resistance classes exhibited distinct distributional patterns across income and immigration gradients, including contrasting associations among beta-lactam, aminoglycoside, and macrolide-lincosamide-streptogramin ARGs, consistent with heterogeneous selection pressures across sub-populations. These findings demonstrate the capacity of longitudinal wastewater surveillance to identify persistent population-level resistome patterns and highlight the importance of incorporating sociodemographic context into AMR surveillance and mitigation strategies.

13

Identification of Persistent Radiomics Feature Co-occurrence Across Diverse Tissue Types and Individuals: A Network-Based Analysis of the RADAPT CT Atlas

Amiri, S.; Afshar, P.; Rohban, M. H.

2026-07-19 radiology and imaging 10.64898/2026.07.17.26358252 medRxiv

Top 12%

0.1%

Show abstract

Objectives. Radiomics pipelines extract hundreds of quantitative features that are widely known to be redundant, but the structure of this redundancy is usually treated as a per-dataset nuisance to be pruned away. We tested the alternative hypothesis that a substantial number of feature-feature correlations are universal: they persist across patients and across anatomically distinct structures because they reflect shared mathematical and image-statistical properties of how the image is summarised, rather than properties of the tissue being imaged. Materials and Methods. We re-analysed the publicly available Radiomics Atlas Dataset of normal Abdominal and Pelvic CT (RADAPT), restricting the analysis to the 526 non-contrast-enhanced examinations of the 531-subject atlas and to the 107 original (non-filtered) PyRadiomics features. The 53 segmented structures were grouped into four broad anatomical categories -- bones, muscles, vessels, and parenchymal organs. RADAPT is distributed as one Excel file per structure, with patients as rows and features as columns. Within each structure file we z-score-normalised every feature across patients, computed the absolute Spearman correlation matrix, and retained edges with |{rho}| [≥] {tau} for {tau} in {0.70, 0.80, 0.90}. We then intersected the edge sets across all structure files to obtain a "universal" correlation graph, in which an edge survives only if it exceeds the threshold in every structure (each estimated across the full patient sample). Stable feature communities were defined as the maximal cliques of this graph. Robustness to patient sampling was tested by repeating the entire pipeline on five independent random splits of each file into two patient halves (10 sub-cohorts per threshold), and the implementation was independently reproduced in R. Results. Despite the strictness of the global-intersection criterion, 34, 24, and 14 stable feature communities survived at {tau} = 0.70, 0.80, and 0.90 respectively, with the largest cliques containing six members at {tau} = 0.70 and {tau} = 0.80 and five members at {tau} = 0.90. The community structure was clearly interpretable: separate cliques captured (i) variance-like intensity dispersion, (ii) long-run / low-frequency (coarse) texture, (iii) high gray-level texture, (iv) low gray-level texture, (v) volume and surface shape, and (vi) local-homogeneity and energy/entropy duals. On random-half resampling the exact-match recovery rate of these communities was 81.5 %, 86.7 %, and 80.7 % across the three thresholds; departures from exact recovery were almost always a single boundary feature added or dropped, consistent with finite-sample fluctuation of near-threshold edges rather than structural instability. The R re-implementation reproduced the Python results exactly. Conclusion. A substantial portion of radiomics feature collinearity is universal across patients and tissues. We distinguish two layers within it: trivial near-algebraic duals that are universal by construction, and non-trivial cross-matrix-family communities that are the genuine empirical finding. Together they provide an interpretable, definition-grounded basis for aggressive dimensionality reduction, for retrospectively reconciling apparently different feature selections in the literature, and for moving radiomics pipelines toward organ-agnostic, more reproducible models. Clinical relevance statement. Selecting a single representative feature from each universal community shrinks the original-feature space by roughly an order of magnitude without sacrificing biologically distinct information. For example, the five variance-family members (first-order Variance, GLCM SumSquares, GLCM ClusterTendency, GLDM and GLRLM GrayLevelVariance) can be replaced by a single representative, removing redundant degrees of freedom that would otherwise inflate model variance; and labelling each retained feature by its community lets two studies that selected different variance-family names be recognised as having found the same signal, simplifying model development and improving cross-cohort generalisability in clinical CT workflows.

14

Nocturnal cough as a syndromic surveillance signal for respiratory illness in England

Irons, T.; Carlsson, E.; Tang, M. L.; Mellor, J.; Rubin, C.; Allen, A.; Elliot, A. J.; Kageback, M.; Packham, J.

2026-07-21 epidemiology 10.64898/2026.07.20.26357937 medRxiv

Top 13%

0.1%

Show abstract

We evaluated aggregated, privacy-preserving smartphone-detected nocturnal cough activity from the Sleep Cycle application as a potential syndromic surveillance signal in England. Weekly cough metrics from January 2023 to January 2026 were compared with UK Health Security Agency indicators: NHS 111 acute respiratory infection (ARI) triage calls, influenza and COVID-19 PCR positivity, and hospital admission rates for influenza, COVID-19, and respiratory syncytial virus. We evaluated total cough counts alongside two population-normalised metrics, coughs per user and coughs per hour of sleep, and assessed temporal relationships nationally and regionally using cross-correlation with prewhitening. The strongest and most consistent associations were observed for NHS 111 ARI triage calls, where population-normalised cough metrics showed raw national correlations of approximately 0.95 and retained prewhitened correlations above 0.55 at lag 0. This indicates that nocturnal cough activity closely tracks short-term variation in an established syndromic surveillance indicator, beyond shared seasonality, long-term trends, and autocorrelation. Similar near-contemporaneous patterns were observed across regions. Population-normalised cough metrics also showed epidemiologically plausible leading associations with pathogen-specific indicators: coughs per hour of sleep peaked one week before influenza PCR positivity, while both coughs per user and coughs per hour of sleep peaked one week before COVID-19 PCR positivity. Hospital-based indicators showed weaker and more heterogeneous relationships, but the normalised cough metrics still showed plausible temporal alignment with influenza and COVID-19 admissions, including contemporaneous associations with influenza admissions and short leading associations with COVID-19 admissions. In contrast, unnormalised total cough counts produced less stable and often non-interpretable lag structures, consistent with sensitivity to variation in observation volume. These findings suggest that passive, near-real-time nocturnal cough monitoring can provide a population-level signal of respiratory symptom burden, with greatest utility as a broad syndromic indicator that complements surveillance sources affected by healthcare-seeking behaviour, laboratory turnaround times, backfilling, and reporting delays.

15

FootNet: A Multi-View Smartphone Dataset and Four-Model Benchmark for Clinical Foot Segmentation

Vijay, A.; Prabhune, A.; Srihari, V. R.; Rayampalli, A.

2026-07-17 health informatics 10.64898/2026.07.15.26358117 medRxiv

Top 13%

0.1%

Show abstract

We present FootNet, a 453-image multi-view smartphone foot dataset for binary foot segmentation, with expertannotated masks across six anatomical views (dorsal, medial, and plantar, both left and right). We benchmark four segmentation models under a controlled protocol: U-Net with a MobileNetV2 encoder achieves the best performance (IoU 0.9268, Dice 0.9608, 95 % CI [0.9209, 0.9320]); DeepLabV3 with MobileNetV3-Large scores IoU 0.8984 (Dice 0.9449); UNet++ with MobileNetV2 scores IoU 0.8913 (Dice 0.9391); and SAM ViT-B with oracle boundingbox prompt scores IoU 0.9219 on the matched 191-image subset. Bonferroni-corrected Wilcoxon signed-rank tests (k = 6 comparisons) show U-Net significantly outperforms DeepLab (p < 0.001, r = 0.638) and SAM ViT-B with oracle boundingbox (p = 0.005, r = 0.202); UNet++ does not significantly differ from DeepLab (p = 0.062). Connected-component postprocessing yields negligible benefit (mean {triangleup}IoU = +0.0003, 12 of 453 images improved). The extended dataset is available upon request

16

ReCo: a self-configuring and self-extending agentic framework for biomedical research

Tzanis, E.; Klontzas, M. E.

2026-07-16 health informatics 10.64898/2026.07.14.26358025 medRxiv

Top 13%

0.1%

Show abstract

This study presents ReCo (Research Cosmos), a self-configuring and self-extending agentic research framework for the biomedical domain. ReCo is orchestrated by a large language model that interacts with native computing tools, bundled Model Context Protocol (MCP) servers, structured skills, persistent project memory, and a desktop interface. Its bundled MCP servers provide biomedical analysis capabilities while serving as implementation paradigms for integrating new computational and AI frameworks. Structured skills encode procedures for environment configuration and framework ingestion, enabling ReCo to inspect repositories, manuscripts, or local codebases; identify dependencies and execution patterns; create isolated runtime environments; design and implement MCP interfaces. Self-extension was evaluated using five heterogeneous systems: the Merlin computed tomography foundation model, MAISI-v2 medical image synthesis framework, asari liquid chromatography-mass spectrometry workflow, DosimeTron agentic radiation-dosimetry platform, and Orthanc DICOM server. ReCo successfully operationalized all five systems and completed predefined functional evaluations. Re-hosted DosimeTron outputs demonstrated near-perfect agreement with the reference pipeline across 651 organ observations (Pearson correlation and Lin concordance correlation coefficient, 0.99999; mean absolute percentage difference, 0.37%). Notably, ReCo configured Orthanc as a PACS-like coordination layer, integrated it with DosimeTron, Merlin, and TotalSegmentator, and orchestrated data retrieval, analysis, and return of valid DICOM RTSTRUCT, RTDOSE, and Structured Report. ReCo provides a unified environment for configuring, documenting, and operationalizing heterogeneous biomedical frameworks, reducing technical barriers to the adoption and integration of emerging computational and AI methods. The official open-source ReCo GitHub repository is available at: https://github.com/eltzanis/ReCo

17

How bursty infectiousness shapes epidemic dynamics

Kissler, S. M.

2026-07-17 epidemiology 10.64898/2026.07.15.26358199 medRxiv

Top 14%

0.1%

Show abstract

An epidemic's expected course is determined by the magnitude and timing of a typical person's infectiousness --- captured, in turn, by the basic reproduction number and the generation-time distribution. These fundamental, population-average quantities can mask individual-level variation that shapes how an epidemic actually unfolds: for example, individual variation in the magnitude of infectiousness (overdispersion) creates superspreading, a key feature of the SARS-CoV-1 and SARS-CoV-2 epidemics. However, the impact of individual variation in infectiousness timing is less well understood. Here, we demonstrate that individual infectiousness timing varies substantially and to different degrees across pathogens. For some common pathogens, including influenza, measles, and SARS-CoV-2, infectiousness is "bursty", or highly concentrated and variably-timed across individuals: for example, the window of appreciable infectiousness for SARS-CoV-2 may last for roughly a day, vs. the 9--12 days usually quoted. We show that bursty infectiousness creates superspreading without inherent superspreaders, makes epidemic timing more variable, amplifies the time-sensitivity of common interventions, and complicates inference of key epidemiological parameters. Together with the reproduction number, the generation-time distribution, and overdispersion, burstiness completes a family of basic parameters that govern how epidemics unfold.

18

Longitudinal multiomic network rewiring at the complement coagulation interface in post-acute sequelae of COVID 19 (PASC)

Ward, B.; Belkhir, L.; Balligand, J.-L.; Cani, P. D.; De Greef, J.; Dewulf, J. P.; Gatto, L.; Haufroid, V.; Kabamba, B.; Vertommen, D.; Yombi, J. C.; Elens, L.; Bommer, G.; Bamps, L.

2026-07-16 infectious diseases 10.64898/2026.07.14.26358048 medRxiv

Top 15%

0.1%

Show abstract

Background. Post acute sequelae of COVID 19 (PASC) is clinically heterogeneous and mechanistically unresolved, and single-analyte studies have struggled to explain it. Methods. We profiled matched plasma proteomics, metabolomics and whole-blood transcriptomics at acute infection and convalescence (mean 86 days later) in a Belgian cohort, using linear mixed models, multiomic gene-set enrichment, and a degree-matched differential-correlation approach to quantify how each node's interactions were rewired between patients who developed PASC and those who recovered; seven axis proteins were additionally quantified by multiplex immunoassay as orthogonal validation. Findings. Single omic testing yielded few FDR significant features, yet multi-omic enrichment showed sustained complement cascade involvement from acute illness to follow-up in PASC. Correlation networks re-organised topologically toward C3 and lost the immunoglobulin V gene coexpression seen in recovery. The most rewired nodes, heparin cofactor II (SERPIND1), alpha 1 antitrypsin (SERPINA1), complement factor H related 5 (CFHR5), prothrombin/thrombin (F2) and immunoglobulin V gene transcripts (notably IGLV3 21), changed in their co-expression structure rather than in abundance. In multiplex validation, acute CRP was elevated in patients who developed PASC (FDR = 0.012), whereas the directly measured abundances of the network-nominated proteins were unchanged. Interpretation. These trajectory aware, cross omic networks nominate a thrombo inflammatory axis in which complement and coagulation regulation remain dysregulated in PASC at the level of wiring rather than abundance, providing a systems framework for validation and for exploring interventions at the complement coagulation platelet interface.

19

Human in vivo immunology of tuberculosis is not affected by sex dimorphism.

Jiang, J.; Greenan-Barrett, J.; Gupta, R. K.; Noursadeghi, M.; Turner, C. T.

2026-07-21 infectious diseases 10.64898/2026.07.20.26358462 medRxiv

Top 15%

0.1%

Show abstract

Males incur greater risk of tuberculosis (TB) than females, but the contribution of sex-associated immune differences remains unclear. We addressed this using sex-stratified transcriptomic analyses across four independent studies spanning active pulmonary TB, subclinical TB and latent infection, in peripheral blood, bronchoalveolar lavage (BAL) and by using the tuberculin skin test (TST) as a standardised in vivo antigenic challenge. In blood of active TB patients, expression of TNF- and type I interferon-regulated signatures, genome-wide gene expression, and performance of leading host-response biomarkers of TB were comparable between sexes. Similarly, blood transcriptomic biomarkers showed no meaningful sex-related differences for predicting asymptomatic or incident TB. In the TST of people with latent infection, bulk and single-cell RNA sequencing identified only limited differences, largely restricted to sex chromosome-linked transcripts, with no consistent evidence of dimorphism in immune-regulated pathways. Single-cell RNA sequencing of BAL samples identified reduced abundance of B cells in male TB patients, with gene expression differences again largely restricted to sex chromosome-linked transcripts. These findings suggest that canonical immune responses associated with TB are broadly similar between the sexes, and that increased TB risk among males more likely reflects differential exposure rather than intrinsic immunological susceptibility.

20

Gradient-guided adapter merging for neuroimaging vision-language models

Bit, S.; Guney, O. B.; Jia, S.; Kolachalama, V. B.

2026-07-21 health informatics 10.64898/2026.07.18.26358397 medRxiv

Top 16%

0.1%

Show abstract

Automated interpretation of neuroimaging studies requires simultaneous assessment of multiple imaging evidence variables, each tied to distinct anatomical structures. Vision-language models (VLMs) offer a unified framework for multi-task analysis, but adapting pre-trained VLMs remains challenging. Full fine-tuning is computationally prohibitive, and joint multi-task training requires simultaneous access to all task data, which is often infeasible in clinical settings. Although model merging enables multi-task composition without joint re-training, existing methods focus on post-hoc algorithms with limited extension to VLMs and minimal application to neuroimaging. Here, we present GRadient-guided Adapter Merging (GRAM), a layer-selective low-rank adaptation (LoRA)-based fine-tuning and merging framework for multi-task neuroimaging visual question-answering (VQA). GRAM uses a gradient ratio that contrasts class-specific gradients to identify task-discriminative layers, and applies subspace-constrained projected gradient descent to restrict LoRA updates to directions consistent with the geometry of the pre-trained model. We leveraged a structured VQA benchmark, developed from the National Alzheimer's Coordinating Center (NACC) dataset, that pairs multi-sequence brain MRI studies with question-answer pairs across clinically relevant imaging evidence variables. Experiments on the VQA benchmark showed that GRAM outperformed or matched all-layer LoRA fine-tuning and a standard merging baseline while reducing inter-task interference during merging, and approached or surpassed the performance of joint multi-task training without joint re-training.